The 2012 Commodity Flow Survey (CFS) is a joint effort by the Bureau of Transportation Statistics and the U.S. Census Bureau, U.S. Department of Commerce. The survey is the primary source of national and state-level data on domestic freight shipments by establishments in mining, manufacturing, whole sale auxiliaries, and selected retail and services trade industries, located in the 50 states and the District of Columbia. The survey produces estimates on the type, origin and destination, value, weight, modes of transportation, distance shipped, and ton-miles of commodities shipped.
Additional information on the Commodity Flow Survey can be found at https://catalog.data.gov/dataset/commodity-flow-survey
## 'data.frame': 4547661 obs. of 20 variables:
## $ SHIPMT_ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ORIG_STATE : int 25 42 26 20 12 24 19 6 19 13 ...
## $ ORIG_MA : int 148 428 220 556 99999 47900 99999 41740 99999 99999 ...
## $ ORIG_CFS_AREA : Factor w/ 169 levels "00-00000","01-00000",..: 73 128 76 56 35 70 53 15 53 39 ...
## $ DEST_STATE : int 25 6 47 20 12 30 19 6 19 13 ...
## $ DEST_MA : int 148 41740 314 556 99999 99999 99999 99999 99999 99999 ...
## $ DEST_CFS_AREA : Factor w/ 132 levels "01-142","01-380",..: 56 11 106 43 27 67 41 14 41 30 ...
## $ NAICS : int 333 311 322 323 4235 337 337 4239 327 4237 ...
## $ QUARTER : int 2 3 2 1 3 2 3 1 3 2 ...
## $ SCTG : Factor w/ 52 levels "00","01","01-05",..: 42 42 33 35 40 49 32 49 37 41 ...
## $ MODE : int 14 14 4 4 5 4 5 14 5 5 ...
## $ SHIPMT_VALUE : int 2178 344 4197 116 388 3716 43738 77 338 145 ...
## $ SHIPMT_WGHT : int 11 11 5134 6 527 1132 13501 4 12826 22 ...
## $ SHIPMT_DIST_GC : int 14 2344 470 3 124 1942 16 272 10 21 ...
## $ SHIPMT_DIST_ROUTED: int 17 2734 579 3 201 2265 22 306 15 28 ...
## $ TEMP_CNTL_YN : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ EXPORT_YN : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 1 1 1 ...
## $ EXPORT_CNTRY : Factor w/ 4 levels "C","M","N","O": 3 3 3 3 3 1 3 3 3 3 ...
## $ HAZMAT : Factor w/ 3 levels "H","N","P": 2 2 2 2 2 2 2 2 2 2 ...
## $ WGT_FACTOR : num 208.5 193.3 51.2 238.7 398.1 ...
## SHIPMT_ID ORIG_STATE ORIG_MA ORIG_CFS_AREA
## Min. : 1 Min. : 0.00 Min. : 0 06-348 : 146728
## 1st Qu.:1136916 1st Qu.:17.00 1st Qu.: 312 17-176 : 104572
## Median :2273831 Median :29.00 Median : 476 27-378 : 88526
## Mean :2273831 Mean :29.11 Mean :38440 55-99999: 88185
## 3rd Qu.:3410746 3rd Qu.:42.00 3rd Qu.:99999 42-99999: 87779
## Max. :4547661 Max. :56.00 Max. :99999 06-488 : 81383
## (Other) :3950488
## DEST_STATE DEST_MA DEST_CFS_AREA NAICS
## Min. : 1.00 Min. : 104 06-348 : 162548 Min. : 212
## 1st Qu.:17.00 1st Qu.: 332 17-176 : 116536 1st Qu.: 326
## Median :29.00 Median : 488 36-408 : 107485 Median : 339
## Mean :28.88 Mean :40368 06-488 : 91410 Mean : 7130
## 3rd Qu.:42.00 3rd Qu.:99999 48-99999: 87025 3rd Qu.: 4241
## Max. :56.00 Max. :99999 42-99999: 86783 Max. :551114
## (Other) :3895874
## QUARTER SCTG MODE SHIPMT_VALUE
## Min. :1.000 35 : 318586 Min. : 0.000 Min. : 1
## 1st Qu.:1.000 24 : 288078 1st Qu.: 4.000 1st Qu.: 176
## Median :2.000 43 : 283551 Median : 5.000 Median : 875
## Mean :2.446 34 : 265539 Mean : 7.117 Mean : 18280
## 3rd Qu.:3.000 40 : 264089 3rd Qu.: 14.000 3rd Qu.: 6026
## Max. :4.000 33 : 237139 Max. :101.000 Max. :521277800
## (Other):2890679
## SHIPMT_WGHT SHIPMT_DIST_GC SHIPMT_DIST_ROUTED TEMP_CNTL_YN
## Min. : 1 Min. : 1.0 Min. : 1 N:4233061
## 1st Qu.: 15 1st Qu.: 17.0 1st Qu.: 21 Y: 314600
## Median : 270 Median : 110.0 Median : 140
## Mean : 37587 Mean : 386.1 Mean : 475
## 3rd Qu.: 8077 3rd Qu.: 538.0 3rd Qu.: 685
## Max. :277029600 Max. :5210.0 Max. :6932
##
## EXPORT_YN EXPORT_CNTRY HAZMAT WGT_FACTOR
## N:4361940 C: 54539 H: 169457 Min. : 0.2
## Y: 185721 M: 19367 N:4244682 1st Qu.: 93.0
## N:4361940 P: 133522 Median : 289.2
## O: 111815 Mean : 2311.0
## 3rd Qu.: 992.7
## Max. :975000.0
##
There are 4,547,661 observations of 20 variables.
Some interesting initial findings:
The states with the most outgoing shipments are California, Texas, Ohio, Pennsylvania, and New York. The states with the least outgoing shipments are Alaska, Wyoming, Delaware, New Mexico, and Vermont.
I would like to check out a histogram of the number of shipments for each origin state.
The distribution of number of outgoing shipments from each state is a right-skewed, long tail distribution.
The states with the most incoming shipments are California, Texas, New York, Florida, and Ohio. The states with the least incoming shipments are Rhode Island, Wyoming, Alaska, Vermont, and Delaware.
I would like to check out a histogram of the number of shipments for each destination state.
The distribution of number of incoming shipments to each state, similar to outgoing shipments from each state, is a right-skewed, long tail distribution.
## # A tibble: 72 x 2
## ORIG_MA n
## <int> <int>
## 1 99999 1630669
## 2 408 208768
## 3 348 146728
## 4 176 127993
## 5 148 124844
## 6 428 109581
## 7 378 88526
## 8 488 81383
## 9 122 79936
## 10 206 73944
## # ... with 62 more rows
## # A tibble: 71 x 2
## DEST_MA n
## <int> <int>
## 1 99999 1706285
## 2 408 236653
## 3 348 162548
## 4 176 131130
## 5 148 114853
## 6 428 94747
## 7 488 91410
## 8 206 83248
## 9 288 79953
## 10 122 77286
## # ... with 61 more rows
The top 5 origin and destination metro areas with the most outgoing/incoming shipments are the same: New York-Newark, Los Angeles-Long Beach, Chicago-Naperville, Boston-Worcester-Providence, and Philadelphia-Reading-Camden. Note: MA 99999 represents the remainder of numerous different areas, and was not counted here.
I would like to check out a histogram of the number of shipments for each origin and destination metro area.
The distribution of number of incoming shipments to each metro area and the distribution of number of outgoing shipments from each metro area both display a right-skewed, long tail distribution. This is what we would expect, given the distribution of shipment quantity for each state.
## # A tibble: 45 x 2
## NAICS n
## <int> <int>
## 1 325 221722
## 2 332 209430
## 3 4238 199770
## 4 311 186452
## 5 4244 175812
## 6 326 151976
## 7 333 147874
## 8 212 140933
## 9 323 139305
## 10 4236 138760
## # ... with 35 more rows
The industries with the most shipments are the Chemical Manufacturing Sector; Aerospace Sector; Machinery, Equipment, and Supplies Merchant Wholesalers; Food Manufacturing; and Grocery and Related Product Merchant Wholesalers. It appears America likes to ship chemicals, machines, and food!!
I would also like to view the distribution of shipments per industry.
The distribution of shipments for each industry appears relatively symmetric, with a bit of a tail to the right.
## # A tibble: 52 x 2
## SCTG n
## <fct> <int>
## 1 35 318586
## 2 24 288078
## 3 43 283551
## 4 34 265539
## 5 40 264089
## 6 33 237139
## 7 30 213796
## 8 29 200529
## 9 32 187159
## 10 07 183021
## # ... with 42 more rows
The industries with the most shipments are Electronic and Other Electrical Equipment and Components, and Office Equipment; Plastics and Rubber; Mixed Freight; Machinery; and Miscellaneous Manufactured Products. These largely align with NAICS Codes, with Plastics and Rubber aligning with the Chemical Manufacturing Sector, and both Electrical Equipment and Machinery aligning with Machinery.
# Determine the number of shipments for each quarter
arrange(count(cfs, QUARTER), desc(n))
## # A tibble: 4 x 2
## QUARTER n
## <int> <int>
## 1 1 1188440
## 2 2 1182850
## 3 3 1137376
## 4 4 1038995
The distribution of shipments is relatively even, with slightly fewer shipments each subsequent quarter.
The three modes of transit for each shipment with by far the most shipments are Private Truck; For-hire Truck; and Parcel, USPS, or Courier. This makes me wonder which modes are typically used for different attributes, such as distance, shipment value, shipment type, etc.
The distribution of shipment value is a right-skewed, long tail distribution. This distribution makes me want to use a logarithmic scale on the x-axis.
As expected, the logarithmic transformed shipment value distribution is relatively symmetric.
The distribution of shipment weight is right-skewed with a peak less than 100 pounds, and a secondary peak around 3 million pounds. This distribution makes me want to use a logarithmic scale on the x-axis.
This logarithmic transformed shipment value distribution looks relatively flat with a low and high peak.
The distribution of shipment distance routed is a right-skewed, long tail distribution. This distribution makes me want to use a logarithmic scale on the x-axis.
Unexpectedly, the logarithmic distribution is bimodal, with a peak around 10 pounds and a larger peak around 1,000 pounds. This makes me wonder what different attributes these two peaks have.
## # A tibble: 4 x 2
## EXPORT_CNTRY n
## <fct> <int>
## 1 N 4361940
## 2 O 111815
## 3 C 54539
## 4 M 19367
As expected, most (95.9%) shipments were not exports. Of the shipments that were exports, 29.4% went to Canada, 10.4% went to Mexico, and 60.2% wentto other countries.
## # A tibble: 3 x 2
## HAZMAT n
## <fct> <int>
## 1 H 169457
## 2 N 4244682
## 3 P 133522
93.3% of shipments are Not Hazmat (good to hear), 2.9% are Class 3.0 Hazmat (flammable liquids), and 3.7% are Other Hazmat.
There are 4,547,661 shipments with 20 characteristics (shipment identifier, origin state, origin metro area, origin CFS area, destination state, destination metro area, destination CFS area, NAICS industry classification, quarter shipment occurred, SCTG commodity code, mode of transportation, shipment value, shipment weight, great circle distance, routed distance, temperature control, export shipment, export destination, hazardous material, and weighting factor).
Some primary inquiries I have are:
I think most of the features can potentially be useful with this broad range of inquiries above.
I counted the occurrences of different categorical variables, but did not create any new ones.
Per above, the distribution of shipment weights, with peaks around 10 pounds and 1,000 pounds is interesting.
Most variables show no relation, but some interesting patterns:
## SHIPMT_ID ORIG_STATE ORIG_MA ORIG_CFS_AREA
## Min. : 195 Min. : 1.00 Min. : 0 48-288 : 2788
## 1st Qu.:1128301 1st Qu.:19.00 1st Qu.: 350 19-99999: 1737
## Median :2286056 Median :29.00 Median :99999 56-99999: 1653
## Mean :2278369 Mean :30.72 Mean :54799 48-99999: 1315
## 3rd Qu.:3423168 3rd Qu.:48.00 3rd Qu.:99999 17-99999: 1086
## Max. :4547643 Max. :56.00 Max. :99999 22-406 : 1082
## (Other) :35816
## DEST_STATE DEST_MA DEST_CFS_AREA NAICS
## Min. : 1.00 Min. : 104 48-288 : 2625 Min. : 212
## 1st Qu.:17.00 1st Qu.: 332 48-99999: 2163 1st Qu.: 311
## Median :29.00 Median :12940 22-406 : 1614 Median : 325
## Mean :30.33 Mean :45108 06-348 : 1393 Mean : 5393
## 3rd Qu.:47.00 3rd Qu.:99999 26-220 : 1118 3rd Qu.: 333
## Max. :56.00 Max. :99999 17-176 : 1053 Max. :551114
## (Other) :35511
## QUARTER SCTG MODE SHIPMT_VALUE
## Min. :1.000 20 : 4388 Min. : 0.00 Min. : 7
## 1st Qu.:1.000 24 : 3597 1st Qu.: 6.00 1st Qu.: 20561
## Median :2.000 15 : 3300 Median : 6.00 Median : 53088
## Mean :2.444 41 : 3011 Mean : 7.67 Mean : 598528
## 3rd Qu.:3.000 02 : 2824 3rd Qu.: 8.00 3rd Qu.: 150672
## Max. :4.000 12 : 2695 Max. :101.00 Max. :521277800
## (Other):25662
## SHIPMT_WGHT SHIPMT_DIST_GC SHIPMT_DIST_ROUTED TEMP_CNTL_YN
## Min. : 170254 Min. : 1.0 Min. : 1.0 N:43936
## 1st Qu.: 190928 1st Qu.: 46.0 1st Qu.: 64.0 Y: 1541
## Median : 212019 Median : 306.0 Median : 454.0
## Mean : 2819972 Mean : 451.5 Mean : 642.5
## 3rd Qu.: 580119 3rd Qu.: 730.0 3rd Qu.:1031.0
## Max. :277029600 Max. :4210.0 Max. :6791.0
##
## EXPORT_YN EXPORT_CNTRY HAZMAT WGT_FACTOR
## N:41564 C: 1219 H: 3897 Min. : 0.3
## Y: 3913 M: 936 N:37148 1st Qu.: 28.9
## N:41564 P: 4432 Median : 67.3
## O: 1758 Mean : 170.2
## 3rd Qu.: 160.2
## Max. :25676.9
##
Some initial interesting findings:
## # A tibble: 45 x 2
## NAICS avg
## <int> <dbl>
## 1 212 588041.
## 2 4245 403757.
## 3 324 354552.
## 4 551114 58268.
## 5 325 41900.
## 6 327 38276.
## 7 4247 32626.
## 8 4239 30116.
## 9 321 28020.
## 10 311 27680.
## # ... with 35 more rows
## # A tibble: 52 x 2
## SCTG avg
## <fct> <dbl>
## 1 15 5596597.
## 2 15-19 4362761.
## 3 99 2812023.
## 4 01-05 2801830.
## 5 14 1683167.
## 6 10-14 1187371.
## 7 02 357706.
## 8 20-24 299220.
## 9 17 270357.
## 10 06-09 257195.
## # ... with 42 more rows
Some observations related to shipment weight:
## [1] 5.465400e-01 1.074609e-01 8.947380e-02 7.617037e-02 6.047013e-02
## [6] 4.538558e-02 3.177430e-02 1.750335e-02 5.541263e-03 3.672186e-03
## [11] 3.518262e-03 2.660686e-03 2.330849e-03 2.198914e-03 1.957033e-03
## [16] 1.605207e-03 9.675220e-04 7.256415e-04 4.397827e-05
## [1] 0.5466138962 0.1578715919 0.1046613896 0.0543095866 0.0211081794
## [6] 0.0167106420 0.0153913808 0.0151715040 0.0151715040 0.0138522427
## [11] 0.0083553210 0.0074758135 0.0072559367 0.0061565523 0.0054969217
## [16] 0.0039577836 0.0004397537
Some observations related to shipment weight and transportation mode:
This makes me wonder how heavy export shipments are transported.
## SHIPMT_ID ORIG_STATE ORIG_MA ORIG_CFS_AREA
## Min. : 28 Min. : 1.00 Min. : 0 48-288 : 1971
## 1st Qu.:1132368 1st Qu.:18.00 1st Qu.: 288 06-348 : 1530
## Median :2263203 Median :32.00 Median : 482 26-220 : 950
## Mean :2270108 Mean :30.62 Mean :40819 42-99999: 846
## 3rd Qu.:3398338 3rd Qu.:45.00 3rd Qu.:99999 19-99999: 845
## Max. :4547516 Max. :56.00 Max. :99999 56-99999: 833
## (Other) :38502
## DEST_STATE DEST_MA DEST_CFS_AREA NAICS
## Min. : 1.00 Min. : 104 48-288 : 2404 Min. : 212
## 1st Qu.:17.00 1st Qu.: 294 06-348 : 2341 1st Qu.: 325
## Median :29.00 Median : 430 22-406 : 1928 Median : 333
## Mean :29.75 Mean :33852 48-99999: 1310 Mean : 10741
## 3rd Qu.:45.00 3rd Qu.:99999 26-220 : 1153 3rd Qu.: 4238
## Max. :56.00 Max. :99999 17-176 : 1134 Max. :551114
## (Other) :35207
## QUARTER SCTG MODE SHIPMT_VALUE
## Min. :1.000 34 : 5401 Min. : 0.000 Min. : 161552
## 1st Qu.:1.000 36 : 3549 1st Qu.: 4.000 1st Qu.: 198120
## Median :2.000 35 : 3387 Median : 4.000 Median : 266118
## Mean :2.432 30 : 2684 Mean : 6.165 Mean : 1070089
## 3rd Qu.:3.000 43 : 2623 3rd Qu.: 6.000 3rd Qu.: 484339
## Max. :4.000 21 : 2426 Max. :101.000 Max. :521277800
## (Other):25407
## SHIPMT_WGHT SHIPMT_DIST_GC SHIPMT_DIST_ROUTED TEMP_CNTL_YN
## Min. : 1 Min. : 1.0 Min. : 1.0 N:42292
## 1st Qu.: 13462 1st Qu.: 61.0 1st Qu.: 84.0 Y: 3185
## Median : 35224 Median : 322.0 Median : 445.0
## Mean : 2410927 Mean : 522.7 Mean : 679.2
## 3rd Qu.: 146578 3rd Qu.: 788.0 3rd Qu.:1041.0
## Max. :277029600 Max. :5012.0 Max. :6791.0
##
## EXPORT_YN EXPORT_CNTRY HAZMAT WGT_FACTOR
## N:37830 C: 1279 H: 2192 Min. : 0.2
## Y: 7647 M: 742 N:40824 1st Qu.: 18.2
## N:37830 P: 2461 Median : 39.8
## O: 5626 Mean : 111.0
## 3rd Qu.: 104.5
## Max. :25676.9
##
Some initial interesting findings:
## # A tibble: 45 x 2
## NAICS avg
## <int> <dbl>
## 1 336 192388.
## 2 324 138415.
## 3 4245 81859.
## 4 325 43281.
## 5 4242 38882.
## 6 4931 37794.
## 7 331 34882.
## 8 551114 34486.
## 9 311 24577.
## 10 333 23431.
## # ... with 35 more rows
## # A tibble: 52 x 2
## SCTG avg
## <fct> <dbl>
## 1 15-19 1380790.
## 2 35-38 1284831.
## 3 00 835268.
## 4 01-05 585695.
## 5 99 355346.
## 6 37 251358.
## 7 20-24 169013.
## 8 14 145027.
## 9 17 120425.
## 10 10-14 117273.
## # ... with 42 more rows
Some observations related to shipment value:
## [1] 0.5637355147 0.1131561009 0.1068012402 0.0463311124 0.0439342965
## [6] 0.0315544121 0.0287178134 0.0226708006 0.0183829188 0.0082899048
## [11] 0.0054313169 0.0031004684 0.0015172505 0.0013853157 0.0013853157
## [16] 0.0012533808 0.0008575764 0.0008575764 0.0004177936 0.0002198914
## [1] 0.3120052770 0.2198768690 0.0976253298 0.0952066843 0.0732189974
## [6] 0.0690413369 0.0457343887 0.0175901495 0.0173702726 0.0147317502
## [11] 0.0096745822 0.0085751979 0.0074758135 0.0048372911 0.0026385224
## [16] 0.0017590150 0.0013192612 0.0006596306 0.0004397537 0.0002198769
Some observations related to shipment value and transportation mode:
Air (incl truck & air), Truck and rail, and Truck and water are the three modes of transit with the highest mean shipment distance. Air (incl truck & air), and Truck and rail were the modes of transit with the highest median shipment distance.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0012 1.1429 1.2128 1.3955 1.3138 839.0000
The median and mean ratio between great circle distance and routed distance are 1.2128 and 1.3955, respectively.
##
## Call:
## lm(formula = cfs$dist_ratio ~ cfs$SHIPMT_DIST_GC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14 -0.28 -0.17 -0.04 837.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.470e+00 2.608e-03 563.62 <2e-16 ***
## cfs$SHIPMT_DIST_GC -1.933e-04 3.768e-06 -51.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.617 on 4547659 degrees of freedom
## Multiple R-squared: 0.0005782, Adjusted R-squared: 0.000578
## F-statistic: 2631 on 1 and 4547659 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = cfs$dist_ratio ~ cfs$SHIPMT_DIST_GC + (log(cfs$SHIPMT_DIST_GC)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20 -0.27 -0.11 -0.01 837.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.824e+00 6.562e-03 278.01 <2e-16 ***
## cfs$SHIPMT_DIST_GC 7.475e-05 5.912e-06 12.64 <2e-16 ***
## log(cfs$SHIPMT_DIST_GC) -1.008e-01 1.714e-03 -58.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.615 on 4547658 degrees of freedom
## Multiple R-squared: 0.001338, Adjusted R-squared: 0.001337
## F-statistic: 3046 on 2 and 4547658 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = cfs$dist_ratio ~ cfs$SHIPMT_DIST_GC + (cfs$SHIPMT_DIST_GC^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14 -0.28 -0.17 -0.04 837.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.470e+00 2.608e-03 563.62 <2e-16 ***
## cfs$SHIPMT_DIST_GC -1.933e-04 3.768e-06 -51.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.617 on 4547659 degrees of freedom
## Multiple R-squared: 0.0005782, Adjusted R-squared: 0.000578
## F-statistic: 2631 on 1 and 4547659 DF, p-value: < 2.2e-16
It appears there is an exponential relationship with great circle shipment distance and distance ratio. The linear, exponential, and quadratic models however, each with low correlation coefficients, say otherwise.
There are few correlations between quantitative variables in the dataset, but the most interesting findings are:
The correlation coefficient between shipment weight and shipment value was 0.33, per the pairplot.
## SHIPMT_ID ORIG_STATE ORIG_MA ORIG_CFS_AREA
## Min. : 1 Min. : 0.00 Min. : 0 06-348 : 55447
## 1st Qu.:1134824 1st Qu.:16.00 1st Qu.: 314 17-176 : 33094
## Median :2273367 Median :29.00 Median : 476 06-488 : 32424
## Mean :2273371 Mean :28.68 Mean :36101 36-408 : 31162
## 3rd Qu.:3410118 3rd Qu.:42.00 3rd Qu.:99999 34-408 : 30823
## Max. :4547661 Max. :56.00 Max. :99999 25-148 : 29408
## (Other):1444921
## DEST_STATE DEST_MA DEST_CFS_AREA NAICS
## Min. : 1.00 Min. : 104 06-348 : 55626 Min. : 212
## 1st Qu.:16.00 1st Qu.: 314 36-408 : 36617 1st Qu.: 327
## Median :29.00 Median : 476 17-176 : 34482 Median : 4236
## Mean :28.69 Mean :36610 06-488 : 31856 Mean : 8810
## 3rd Qu.:42.00 3rd Qu.:99999 25-148 : 31429 3rd Qu.: 4244
## Max. :56.00 Max. :99999 48-288 : 29084 Max. :551114
## (Other):1438185
## QUARTER SCTG MODE SHIPMT_VALUE
## Min. :1.00 43 : 122804 Min. : 0.000 Min. : 1
## 1st Qu.:1.00 19 : 103513 1st Qu.: 5.000 1st Qu.: 138
## Median :2.00 31 : 89503 Median : 5.000 Median : 490
## Mean :2.46 35 : 84115 Mean : 5.476 Mean : 16614
## 3rd Qu.:3.00 24 : 83471 3rd Qu.: 5.000 3rd Qu.: 2431
## Max. :4.00 07 : 82533 Max. :101.000 Max. :521277800
## (Other):1091340
## SHIPMT_WGHT SHIPMT_DIST_GC SHIPMT_DIST_ROUTED TEMP_CNTL_YN
## Min. : 1 Min. : 1.0 Min. : 1.00 N:1536312
## 1st Qu.: 29 1st Qu.: 6.0 1st Qu.: 7.00 Y: 120967
## Median : 407 Median : 11.0 Median :13.00
## Mean : 31098 Mean : 13.8 Mean :17.05
## 3rd Qu.: 8381 3rd Qu.: 20.0 3rd Qu.:25.00
## Max. :215926090 Max. :1690.0 Max. :49.00
##
## EXPORT_YN EXPORT_CNTRY HAZMAT WGT_FACTOR
## N:1640663 C: 1173 H: 105930 Min. : 0.3
## Y: 16616 M: 941 N:1458485 1st Qu.: 137.4
## N:1640663 P: 92864 Median : 406.6
## O: 14502 Mean : 2118.4
## 3rd Qu.: 1232.3
## Max. :975000.0
##
## dist_ratio
## Min. : 0.00118
## 1st Qu.: 1.00000
## Median : 1.21429
## Mean : 1.24275
## 3rd Qu.: 1.35000
## Max. :44.00000
##
## SHIPMT_ID ORIG_STATE ORIG_MA ORIG_CFS_AREA
## Min. : 19 Min. : 0.00 Min. : 0 27-378 : 22033
## 1st Qu.:1135294 1st Qu.:17.00 1st Qu.: 312 06-348 : 21761
## Median :2270721 Median :29.00 Median : 464 48-206 : 15130
## Mean :2273187 Mean :29.12 Mean :38056 55-99999: 13843
## 3rd Qu.:3409801 3rd Qu.:42.00 3rd Qu.:99999 17-176 : 11512
## Max. :4547655 Max. :56.00 Max. :99999 19-99999: 10851
## (Other) :418671
## DEST_STATE DEST_MA DEST_CFS_AREA NAICS
## Min. : 1.00 Min. : 104 48-99999: 25321 Min. : 212
## 1st Qu.:12.00 1st Qu.: 312 06-348 : 25318 1st Qu.: 326
## Median :30.00 Median : 476 48-206 : 24146 Median : 334
## Mean :29.43 Mean :37106 48-288 : 22352 Mean : 5951
## 3rd Qu.:48.00 3rd Qu.:99999 12-370 : 17319 3rd Qu.: 4236
## Max. :56.00 Max. :99999 08-216 : 12418 Max. :551114
## (Other) :386927
## QUARTER SCTG MODE SHIPMT_VALUE
## Min. :1.00 35 : 61620 Min. : 0.000 Min. : 1
## 1st Qu.:1.00 40 : 54726 1st Qu.: 4.000 1st Qu.: 156
## Median :2.00 30 : 41102 Median : 14.000 Median : 937
## Mean :2.43 34 : 39746 Mean : 9.534 Mean : 21547
## 3rd Qu.:3.00 24 : 36761 3rd Qu.: 14.000 3rd Qu.: 7616
## Max. :4.00 38 : 32056 Max. :101.000 Max. :521277800
## (Other):247790
## SHIPMT_WGHT SHIPMT_DIST_GC SHIPMT_DIST_ROUTED TEMP_CNTL_YN
## Min. : 1 Min. : 2 Min. :1001 N:491357
## 1st Qu.: 5 1st Qu.: 928 1st Qu.:1151 Y: 22444
## Median : 39 Median :1077 Median :1329
## Mean : 67678 Mean :1117 Mean :1386
## 3rd Qu.: 1158 3rd Qu.:1315 3rd Qu.:1593
## Max. :138300770 Max. :2345 Max. :1999
##
## EXPORT_YN EXPORT_CNTRY HAZMAT WGT_FACTOR dist_ratio
## N:459032 C: 14964 H: 7915 Min. : 0.2 Min. : 0.5835
## Y: 54769 M: 10544 N:501340 1st Qu.: 73.7 1st Qu.: 1.1611
## N:459032 P: 4546 Median : 232.0 Median : 1.2044
## O: 29261 Mean : 3014.4 Mean : 1.8746
## 3rd Qu.: 874.0 3rd Qu.: 1.2732
## Max. :975000.0 Max. :839.0000
##
The low and high peak for shipment distances appear to differ in shipment weight and shipment value.
The distribution of shipment weight for the low-distance (0 to 50 miles) and high-distance (1,000 to 2,000 miles) shipments appears to be very similar. It seems like this graph would be better if the count were a proportion of total count for each distance category, as this would account for the total number of shipments.
These three line graphs show that a higher proportion of high distance shipments have low shipment weights. At approximately 65 pound shipment weight, the proportion of low distance shipments surpasses that of high distance shipments.
The patterns of distance ratio behave pretty erratically. The only patterns that can be observed are the ratio decreasing as distance increases. The distance ratio for non Mexico/Canada exports also appear to be the highest of any group.
Now subset the data to only include the predominant modes of transport.
For-hire and Private Truck have the highest percentages of shipments until about 1,000 miles, where Parcel, USPS, or courier has the highest percentage of shipments.
Now subset the data to only include the predominany modes of transport.
Parcel, USPS, or Courier has the highest percentage of shipments up to about 70 pounds, and then falls precipitously to practically none of the shipments. For-Hire Truck and Private Truck have the highest percentage of shipments that weigh greater than 70 pounds. Air Travel has some packages up to about 70 pounds, and then falls to a very small percentage of packages.
The main findings relate to the mode of transport used:
I was surprised the ratio between great circle distance and routed distance behaved so erratically with exports. This might relate to the location where shipments can legally enter each country.
## SHIPMT_ID ORIG_STATE ORIG_MA ORIG_CFS_AREA DEST_STATE DEST_MA
## 1 472066 48 206 48-206 48 206
## DEST_CFS_AREA NAICS QUARTER SCTG MODE SHIPMT_VALUE SHIPMT_WGHT
## 1 48-206 336 3 37 11 521277800 91990
## SHIPMT_DIST_GC SHIPMT_DIST_ROUTED TEMP_CNTL_YN EXPORT_YN EXPORT_CNTRY
## 1 7 7 N N N
## HAZMAT WGT_FACTOR dist_ratio
## 1 N 3.5 1
The distribution of shipment value is a right-skewed long-tailed distribution with the majority of shipments falling below $1,000. The maximum valued shipment of $521,277,800 was a 91,990 pound ‘Transportation Equipment Manufacturing’ shipment delivered 7 miles by Air from Dallas-Fortworth to Dallas-Fortworth.
For-hire and Private Truck have the highest percentages of shipments for shipment distances up to about 1,000 miles, where Parcel, USPS, or courier has the highest percentage of shipments.
Parcel, USPS, or Courier has the highest percentage of shipments up to about 70 pounds, and then falls precipitously to practically none of the shipments. For-Hire Truck and Private Truck have the highest percentage of shipments that weigh greater than 70 pounds. Air Travel has a low percentage of packages up to about 70 pounds, and then falls to a very small percentage of packages.
The main conclusions drawn from the exploration of the Commodity Flow Survey dataset are:
Some challenges with the analysis relate to the large size (4,547,661 observations). As a result of the size, a 10,000 observation sample was used to explore different resource-heavy plots, such as the pairplot.
I would like to further explore different regressions that could be conducted with different variables in the dataset. Can mode of transit be predicted by any set of variables?